Goto

Collaborating Authors

 constraint 1




Supplementary materials for " Optimizing Information-theoretical Generalization Bound via Anisotropic Noise in SGLD "

Neural Information Processing Systems

The supplementary materials are organized as follows. The first lemma is a standard result characterizing the KL divergence between two Gaussian distributions. The proof is then completed by induction. Specifically, let A be an anti-symmetric matrix. Since Eq.(12) holds for any anti-symmetry matrix By Eq.(12), we have null B (G The proof of Lemma 9 can then be obtained by combining Lemma 10 and Lemma 11 together. Proof of Lemma 2. The β -smooth condition gives R Take expectation on Eq.(21) with respect to W Applying Eq.(24) back to Eq.(23) completes the proof.



Solving SUDOKU with Binary Integer Linear Programming(BILP)

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Sudoku is a logic-based puzzle that first appeared in the U.S. under the title "Number Place" in 1979 in the magazine Dell Pencil Puzzles & Word Games [6].


Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD

Wang, Bohan, Zhang, Huishuai, Zhang, Jieyu, Meng, Qi, Chen, Wei, Liu, Tie-Yan

arXiv.org Machine Learning

Recently, the information-theoretical framework has been proven to be able to obtain non-vacuous generalization bounds for large models trained by Stochastic Gradient Langevin Dynamics (SGLD) with isotropic noise. In this paper, we optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance if both the prior and the posterior are jointly optimized. This validates that the optimal noise is quite close to the empirical gradient covariance. Technically, we develop a new information-theoretical bound that enables such an optimization analysis. We then apply matrix analysis to derive the form of optimal noise covariance. Presented constraint and results are validated by the empirical observations.